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1  Introduction 


The  assessment  of  statistical  procedures  often  requires  the  evaluation  of  error  proba¬ 
bilities  that  can  be  written  as 

e  =  j (jwdx,  (i) 

where  /  is  a  probability  density,  and  A  C  Rv  is  a  tail  region  typically  of  the  form  {x  €  Rp  : 
xi  >  a»  ,  i  =  1  >  •  •  • ,  p}.  When  there  is  some  special  structure  in  such  a  problem,  analytically 
tractable  approximations  or  inequalities  are  available  ([10], [15]);  when  this  is  not  the  case, 
however,  Monte  Carlo  methods  are  often  the  only  available  option,  especially  for  large 
p  ([14]).  Direct  Monte  Carlo  uses  the  average  of  n  independent  replicates  of  I(X  €  A), 
where  X  is  a  random  vector  with  density  /  and  1(E)  is  the  indicator  function  of  the 
event  E.  This  unbiased  estimate  of  6  has  variance  (5  —  0a)/n,  and  squared  coefficient 
of  variation  ( 6  —  03)/n0 ;  thus,  when  0  is  small,  inordinately  large  n  are  needed  to  get 
a  sufficiently  accurate  estimate.  In  such  cases,  importance  sampling  ([7], [11])  is  a  useful 


variance  reduction  technique  which  uses  the  expression 


9=f  ^-g(x)dx,  (2) 

Ja  g(x) 

for  some  "sampling  density”  g.  This  leads  to  another  unbiased  estimator  which  is  the 
average  of  n  independent  replications  of  £  A)  (where  Y  has  density  g)  with 

variance  (JA  —  62)/n.  The  problem  then  is  to  find  the  sampling  density  which 

has  a  variance  that  is  substantially  smaller  than  that  of  direct  Monte  Carlo  (that  is,  for 
which  f  /g  is  close  to  constant  on  A).  In  practice,  generating  observations  from  g  and 
evaluating  the  ratio  f(x)/g(x)  should  also  be  easy;  otherwise,  any  savings  in  variance 
icduction  could  be  oifset  by  the  added  cost  of  doing  these  calculations.  This  reduction 
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in  variance  can  be  used  in  two  ways:  we  can  achieve  a  desired  accuracy  with  a  reduced 
sample  size,  or  we  can  get  a  more  accurate  estimate  with  the  same  sample  size.  There 
is  a  large  literature  on  importance  sampling  and  other  variance  reduction  techniques:  for 
entries  into  this  literature,  see  for  example  [1],[2],[3],[6],  and  [9];  for  recent  applications  to 
engineering,  see  the  references  in  [16]. 

It  is  well  known  ([7])  that  the  choice  g(x)  =  f(x)I(x  E  A)/ 6  gives  an  estimate  with 
zero  variance,  but  this  observation  is  of  limited  use  since  it  depends  upon  the  unknown 
9.  An  alternative  approach  is  to  restrict  attention  to  a  parametric  family  of  functions  g, 
and  to  choose  an  optimal  one  (or  nearly  so)  from  that  family.  In  this  paper,  we  study 
this  technique  for  the  evaluation  of  tail  probabilities.  Our  interest  is  in  the  accurate 
and  efficient  evaluation  of  very  small  probabilities;  thus,  we  seek  methods  with  bounded 
coefficient  of  variation  as  the  probability  decreases  (usually  as  the  region  of  integration 
goes  to  infinity),  and  we  assess  their  performance.  Section  2  contains  our  notation  and 
some  preliminaries.  Section  3  contains  the  univariate  case  and  provides  a  heuristic  for 
construction  sampling  density  families.  Section  4  contains  the  multivariate  case,  and 
provides  interesting  connections  of  this  work  with  Mills’  ratio,  for  which  we  provide  an 
alternative  definition.  Section  5  contains  proofs  of  selected  results  of  this  paper. 


2  Notation  and  Preliminaries 


Given  an  unbiased  estimate  8  of  0,  our  criterion  for  assessing  its  performance  ;s  the 
squared  coefficient  of  variation  (cv3): 


,2fa\  1  War(0)  _  1  /M*!2 


cv3(9)  = 


n  03  n  03 


(3) 


This  expression  for  cu 2  allows  us  to  consider  only  the  case  n  =  1,  and  study  E(92)  as 
a  function  of  the  set  A.  Let  Np(0,  E)  denote  a  normal  distribution  with  mean  0  and 
covariance  E  =  (pij)  with  p„  =  1;  when  E  is  nonsingular,  4>p(x\ E)  is  the  corresponding 
density.  When  p  =  1,  let  4>(x)  and  <f>  denote  the  standard  normal  distribution  and  density, 
respectively.  A  NP(Q,  I)  variate,  where  /  stands  for  the  identity  matrix,  will  be  denoted 
Z ;  its  dimension  will  be  clear  from  the  context. 

A  function  which  appears  in  many  calculations  below  is  Mills’  ratio  ([4]): 

M(x)  =  figi  (4) 

and  its  multivariate  version  ([12]) 


M(x;E) 


P(X  >  x ) 
«*;2)  * 


(5) 


where  X  is  a  Np{ 0,  E)  variate,  and  x  >  y  means  ij  >  for  all  *.  The  properties  of  these 


functions  that  we  need  are  given  in  the  appendix.  In  Section  4,  we  introduce  another 


definition  of  a  multivariate  normal  Mills’  ratio  and  study  its  properties  there. 


3  The  Univariate  Case 

Even  though  importance  sampling  is  much  more  important  for  high- dimensional  prob¬ 
ability  integrals,  we  begin  with  a  treatment  of  the  univariate  case  for  several  reasons.  First, 
it  allows  for  a  careful  analytical  examination  of  the  degree  of  improvement  that  impor¬ 
tance  sampling  affords.  Next,  it  provides  guidelines  for  choosing  families  of  sampling 
density  for  the  multivariate  case.  Finally,  our  discussion  here  provides  justification  of 
some  of  the  results  in  [16]. 
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We  start  with  normal  tail  probabilities:  let  6  =  $(— a)  for  large  positive  a,  for  which 
direct  Monte  Carlo  has  cv 2  =  $(— a)-1  —  1.  Two  families  of  sampling  density  that  have 
been  suggested  in  this  case  ([16])  are  {<f>(x— fi)  :  fi  £  R}  and  the  :  fi  6  R,  o  >  0}. 

For  the  first  family,  after  a  change  of  variables  we  have 

B  =  e-*'2  f°°  e-^a :)dx ,  (6) 

suggesting  the  estimator 

=  tT*,%e-»*IiZ  >a-n),  (7) 

which  has 

Eft)  =  i-o).  (8) 

A  special  case  is  fi  =  a:  since  M(x)  ~  1/x  for  large  x,  we  have 

cv’(0a)  ~  yj~a  -  1  (9) 


It  turns  out  (proofs  of  (10  to  (14)  are  in  the  appendix)  that  the  optimal  members  of  each 
of  the  two  sampling  density  families  do  not  provide  much  more  of  an  improvement  as 
a  — ►  oo.  For  the  first  family,  the  optimal  value,  ft*,  of  n  satisfies  2 fiM(a  +  fi)  =  1,  am* 
since  M(x)  >  -—r  for  x  >  0,  a  <  fi*  <  a  +  \/l  +  a3,  so  that 


var(B „.) 
var(Ba ) 


— *■  1  as  a 


oo. 


(10) 


For  the  second  family,  the  optimal  values  are  fi*  =  (a  +  \/a2  +  2)/2  and  a*  =  l/v/2»  so 


that 


var(gM.t<r.)  1 
var(6* )  y/2 


oo. 


(11) 
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Thus,  the  optimal  members  of  these  sampling  density  families  provide  a  moderate  im¬ 
provement  over  <j>(x  —  a),  which  provides  a  substantial  improvement  over  direct  Monte 
Carlo. 


Notice,  however,  that  the  optimal  estimators  above  (and  6a)  each  have  coefficient  of 
variation  increasing  at  the  rate  y/a  as  a  increases,  so  that  the  relative  error  increases 
without  bound  as  the  threshold  increases.  In  [16],  the  exponential  density  is  shown  to 
overcome  this  problem.  Writing 


e  = 


(12) 


yields  the  estimator 


0  =  Me-*’/*v 


(13) 


where  X  has  the  exponential  density  e  *  for  x  >  0.  We  have 


M(a)2ay/ 2  a2 


as  a  — ♦  oo  , 


(14) 


so  that  the  relative  error  actually  decreases  with  the  tail  probability. 

More  generally,  a  heuristic  method  for  choosing  a  sampling  density  (based  on  a  "sim¬ 
ulacrum”)  is  provided  in  [16]  for  such  problems.  Here,  we  propose  an  alternative  heuristic 
based  on  some  elementary  calculations.  If  X  has  density  /,  l’Hopital’s  rule  says  that 
with  suitable  regularity,  the  asymptotic  behavior  of  P(X  >  a)/ f(a)  is  the  same  as  that 
of  — /(a)//'(a).  Using  a  tractable  approximation  r(a)  for  — /(a)//'(a),  we  get 


«  =  p(X>  a)  =  r(a)/(a)  /"  =  r(a)/(a)  /“  (15) 

Under  the  same  regularity  conditions,  the  last  integral  in  (15)  approaches  1  as  a  — ►  oo; 
thus,  it  is  bounded  away  from  0,  and  estimating  it  with  good  relative  accuracy  can  be  done 
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using  importance  sampling.  In  fact,  it  is  a  rather  remarkable  fact  that  the  phenomenon 
observed  in  (14)  is  quite  general:  for  a  wide  class  of  problems,  the  coefficient  of  variation 
not  only  remains  bounded,  but  it  goes  to  zero  ,  and  hence  the  relative  accuracy  improves 
as  the  threshold  a  increases.  In  addition,  this  method  is  feasible  since  the  calculation  of 
r(a)  depends  on  the  differentiation  of  the  density  rather  than  its  integration;  since  the 
tail  behavior  F  is  already  captured  by  r(a)/(a),  the  evaluation  of  the  remaining  integral 
by  Monte  Carlo  just  provides  a  correction  term. 

In  practice,  we  use  (15)  or  either  of  the  following  two  expressions  for  6: 


=  *•(<*)/(«)  J 


°°  /(  f  +  <0 

i  ar(a)/(a)‘ 


=  r{a)f(a) J 


00  a/(a(x  +  1)) 
r(a)/(a) 


and  use  as  a  sampling  density  one  that  matches  the  tail  behavior  of  the  integrand.  The 
calculation  for  the  normal  in  (12-14)  uses  the  first  expression  in  (16).  Instead  of  giving 
a  general  result,  we  present  some  examples  to  illustrate  the  calculations  (in  all  examples, 
cv 2  tends  to  0  as  a  tends  to  oo). 

Example  1:  Let  fp(z)  =  x0e~x/T((3)  so  that  r(a)  =  ~  1,  and  using  (15)  we  have 


*  =  /«(")  /”(-  +  lfe-dx 
Jo  a 


(17) 


yielding  the  estimator 

«  =  /«(■>)(-  +  l)" .  (18) 

a 

where  X  has  a  standard  exponential  distribution.  Notice  that  the  tail  of  "generalized 
Gaussian  distribution”  ([16])  with  density  proportional  to  exp(— x~0/{3)  can  be  reduced 
to  this  case  by  a  change  of  variables.  Notice  also  that  when  (3  =  0,  the  variance  is 
identically  0. 
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Example  2:  For  the  t  distribution  with  k  degrees  of  freedom,  the  density  /t(x)  is 


proportional  to  (1  + 


so  that  using  the  second  equation  in  (16)  we  have 


G  °f  (,\\(k  +  a2)X2i^ 

6 ~  kMa)[(kTw)]  1 


(19) 


where  X  has  density  k/xk+1  for  x  >  1.  A  close  examination  of  this  example  with  k  =  1 
(Cauchy  distribution)  shows  that  this  method  works  even  when  l’Hopital’s  rule  does  not 
give  the  right  constant  in  the  asymptotics:  in  this  case  the  two  differ  by  a  factor  of  2, 
r(a)/(a)  ~  l/2ira,  while  P(X  >  o)//(a )  ~  1/ira. 


The  use  of  (6)  with  n  —  a  is  an  instance  of  what  is  called  "improved  importance 
sampling”  (IIS)  [16];  we  now  show  that  the  method  based  on  l’Hopital’s  rule  described 
above  is  better  than  IIS.  Let  T  >  0  be  a  random  variable  with  density  /,  which  is 
decreasing;  then  IIS  writes 


6  =  Ia  f(x)dx  =  JQ  ^'/(aj ~f(x)dx  (20) 

to  suggest  the  estimator  Qa  —  f(T  +  a)//(T).  Now  suppose  that  /(x  +  a)//(a)  <  h(a )  for 
some  function  h  <  1  for  all  x  >  0.  When  /  is  normal,  h(a)  =  e~°3/a;  when  /  is  logistic, 
h(a )  =  e~°;  and  when  /  is  a  t  density  with  k  degrees  of  freedom,  h(a)  =  1.  Since 

=  Jq  Kx  +  a)dx  -  Ja  Ka)f(x)dx  =  h(a)d »  (21) 


we  have 


and  <  h(a) , 

8  0  var(60) 


(22) 


where  60  is  the  direct  Monte  Carlo  estimate.  Thus,  even  though  IIS  is  an  improvement 
over  direct  Monte  Carlo  for  such  problems,  (22)  shows  that  unless  (as  a  — ►  oo)  h  goes  to 
0  faster  than  9  does,  the  IIS  estimator’s  coefficient  of  variation  will  tend  to  infinity. 
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4  The  Multivariate  Case 


We  now  turn  to  the  multivariate  case,  concentrating  on  the  multivariate  normal.  For 
this  discussion,  we  principally  deal  with  the  spherically  symmetric  case,  E  =  I.  There 
is  no  loss  of  generality  because  of  the  following  simple  transformation:  if  X  £  R?  has 
covariance  matrix  E,  then  P(X  G  A)  =  P{ E~iX  G  E-JA),  and  E'JX  is  spherically 
symmetric;  also,  a  simulation  which  generates  multivariate  observations  usually  starts  by 
generating  the  spherically  symmetric  variates. 

Now  suppose  that  Z  is  a  1VP(0, 1)  variate  and  A  6  Rp  is  a  closed  convex  set  (such  as  an 
orthant,  a  rectangle,  or  a  sphere).  Suppose  that  A  does  not  contain  the  origin;  then  there 
is  a  point  a  £  A  that  is  closest  (in  ordinary  Euclidean  distance)  to  the  origin:  |  o  |<|  x  \ 
for  all  x  €  A.  As  before,  let  8  =  P(X  G  A).  Analogous  to  (6)  we  have  ti:e  estimators 

60  =  I(Z  G  A)  (23) 

and 

8a  =  e-W/2e-a'zI(Z  G  A -a),  (24) 

where  A  —  a  =  {x  —  a  :  iG  A}. 

We  first  show  that  8a  is  better  than  90.  Since  A  is  convex  and  does  not  contain  0,  A 
is  contained  in  the  half  space  {x  :  a'x  >  a'a}.  Thus, 

E(Bl)  =  ea'aP{Z  G  A  +  a)  =  ea'a/ 2  [  e~a'x<l>p{x ;  I)  dx  <  e~a'al2B  =  e~a'a/2 E(620) .  (25) 

Thus,  as  |  a  |— »  oo,  Ba  provides  a  vast  improvement  over  B0.  The  convexity  of  A  is 
crucial,  as  seen  by  the  following  counterexample.  Let  A  =  {x  G  R2  :|  x  |>  r},  and  let 
6  =  P(Z  G  A),  so  that  the  direct  Monte  Carlo  estimate  has  second  moment  8.  If  we  use 
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the  sampling  density  <f>p(x  —  a;  /),  where  a  is  any  point  on  the  circle  {x  £  R2  :j  x  j=  r},  a 
routine  calculation  shows  that  the  new  estimator  has  second  moment  er3/2P(x2(l)  ->  r2), 
which  is  at  least  er  /JP(x2  ^  r2  =  eri^9,  since  the  non-central  chi-square  is  stochastically 
larger  than  the  central  one  (x3(V,a)  denotes  a  non-central  chi-square  variate  with  non¬ 
centrality  parameter  V'2)- 

It  can  be  shown  that  analogous  to  (14),  cv2^,,)  increases  at  a  rate  proportional  to 
|  a  |p,  and  so  we  generalize  the  approach  leading  to  (15)  to  get  estimators  with  bounded 
cv 2  after  some  preparation.  We  introduce  here  a  definition  of  a  multivariate  Milk’  ratio, 
which  arises  in  our  work  below.  Let  A  and  a  be  as  before,  and  let 

M(A,I)  =  P(Z  €  A)/<f>p(a\I).  (26) 

Notice  that  we  have  set  E  =  I  in  (26);  alternatively,  we  can  define 

M(yl;E)  =  P(*€  A)/^,(a;E).  (27) 

for  X  a  Np(0t  £)  vanate,  where  a  £  A  minimizes  the  Mahalanobis  distance  from  the 
origin.  That  is, 

a'Z-'a  <  x'E-lx  (28) 

for  all  x  €  A.  For  a  given  A,  finding  a  is  a  quadratic  programming  problem,  and  can 
be  solved  by  standard  iterative  methods  [5].  Now  consider  the  case  in  which  A  is  the 
orthant  {x  €  Pp  :  x  >  b}  for  some  b  €  Rp.  For  evaluating  P(X  g  j4)  where  X  is  a 
Np(0,E)  variate,  the  sampling  density  <f>p{x  —  6;/ )  has  been  suggested  ([2]).  In  addition, 
the  multivariate  generalization  of  Mills’  ratio  proposed  in  [12]  is  P( X  €  A)/<f)p{x  —  b\  E), 
and  approximations  to  it  have  been  studied  in  [8], [13].  We  now  propose  new  alternatives 
to  each  of  these. 
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First,  instead  of  using  <j>p( x  —  b\  E),  (25)  suggests  the  use  of  the  sampling  density 
<t>p( x  —  a;  E),  where  a  is  given  by  (28).  A  simple  example  clarifies  the  8'tuation.  Suppose 
that  p  =  2,  A  =  {x  :  ii  >  a!,ia  >  0}.  If  corr(Xi,  X3)  =  p,  then  a  =  (ai,aip)  while 
b  =  (ai,0).  By  (25),  the  use  of  a  is  better  than  direct  Monte  Carlo;  furthermore,  it  can 
be  shown  here  that  the  use  of  a  is  better  than  that  of  b,  and  that  the  use  of  b  can  actually 
be  worse  than  direct  Monte  Carlo  for  sufficiently  large  p. 

More  generally,  for  any  convex  A  not  containing  the  origin,  for  E  =  I  we  can  rotate 
;he  axes  to  get  a  =|  a  |  ei,  where  e\  is  the  unit  vector  in  the  x\  direction,  so  that  A  is 
contained  in  the  half  space  {x  :  ij  >|  a  |}. 

P{Z  €  A)  =  4>p(a]  I)  f  e-Wx/2dx  (29) 

JA-a 

or  after  some  reductions, 

P(Z  €  A)  =  f  c-*?/Jl°lexp(_t1  _  y'yl2)dhdy  (30) 

|  a  |  JT.{A-a) 

where  y  =  (t2, . . .  ,tp)  and  T0  is  the  matrix  diag(\  a  |,  1, . . . ,  1). 

The  expression  (30)  yields  several  results.  First,  it  shows  that  the  sampling  density 
should  be  proportional  to  exp(— 1\  —  y'y/ 2);  using  methods  similar  to  those  leading  to 
(14),  it  can  be  shown  that  the  coefficient  of  variation  decreases  to  zero  if  A  —  a  is  a  cone, 
or  if  Ta(A  —  a)  =  A  —  a.  Next,  it  provides  an  inequality  for  our  Mills’  ratio. 

Proposition  1  :  For  corner  A, 

e-lalJ/2 

Af(A;/)<-=f-.  (31) 

V  2*  |  a  | 

An  asymptotic  expansion  analogous  to  that  of  the  one-dimensional  case  can  be  derived 
also  by  expanding  the  integrand  e“‘*/3lal  in  (30)  and  integrating  term  by  term. 
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5  Appendix 


Proof  of  (10):  Let  g(fx)  be  E(0*)  from  (8);  to  find  the  optimal  value  g,*  of  fi,  we  solve 


g'(fx)  =  0  or 


-f  a)  =  1. 


Clearly,  y.*  >  0;  next,  if  0  <  fx  <  a  then 

<33> 

so  that  fi*  >  a.  Next,  since  for  x  >  0  we  have  M(x)  >  y^y, 

~  =  M(a  +  /x*)  >  ,  a.+  h*  ,  (34) 

2fi*  v  ^  ‘  1  +  (a  +  (i *)2  ’  v  ' 

we  have  fx*  <  y/d1  +  1  <  a  +  yk  To  show  the  asymptotic  equivalence  of  and  0a,  note 

that  we  only  need  consider  second  moments  instead  of  variances.  Using  (32),  to  get  the 

equation  below,  and  writing  /i  for  fx*}  we  have 

E(y  =  Sjg+p  „  «  ( 1  (  _  a)1)  (35) 

0a  2fie*  $(—2a)  n  V2V  11  ' 

which  tends  to  1  as  a  — ►  oo. 

Proof  of  (11):  Writing  r  =  1/a,  we  have 


#-r 

Jr(o-M)  T  <£(l) 


For  derived  from  this  expression, 


e&j  — (2  ~  ;‘)a + ^ ). 

i  v/2  —  ra  v2-t3'  v  v/2  -  r2  ' 


Using  the  same  methods  as  in  the  proof  of  (10),  it  can  be  shown  that  the  optimal  values 
of  n  and  r  satisfy  (*)  o  <  fi*  <  a  +  *k  and  1  <  r*  <  2.  Writing  p  and  r  for  n*  and  r*, 
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and  using  the  approximation  M(x)  ~  1/x,  it  is  easy  to  show  that 


g(jy 

E(6l) 


(38) 


Since  fi  —  a  <  ^  and  r  >  1,  the  ratio  in  (38)  is  bounded  between  1/2  and  1  as  a  — ►  oo, 
so  that  the  optimal  member  of  this  family  is  asymptotically  equivalent  (up  to  a  constant) 
to  6a.  In  fact,  more  involved  calculations  show  that  the  optimal  values  are  those  given 
above  in  (11). 
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